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Abstract 


Markov chain Monte Carlo (MCMC) sampling is an important and commonly used tool 
for the analysis of hierarchical models. Nevertheless, practitioners generally have two options 
for MCMC: utilize existing software that generates a black-box “one size fits all” algorithm, 
or the challenging (and time consuming) task of implementing a problem-specific MCMC 
algorithm. Either choice may result in inefficient sampling, and hence researchers have become 
accustomed to MCMC runtimes on the order of days (or longer) for large models. We propose 
an automated procedure to determine an efficient MCMC algorithm for a given model and 
computing platform. Our procedure dynamically determines blocks of parameters for joint 
sampling that result in efficient sampling of the entire model. We test this procedure using 
a diverse suite of example models, and observe non-trivial improvements in MCMC efficiency 
for many models. Our procedure is the first attempt at such, and may be generalized to 
a broader space of MCMC algorithms. Our results suggest that substantive improvements 
in MCMC efficiency may be practically realized using our automated blocking procedure, or 
variants thereof, which warrants additional study and application. 
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1 Introduction 


Markov chain Monte Carlo (MCMC) has become a core computational method for many statistical 
analyses. These include routine Bayesian analyses, but also hybrid algorithms that use MCMC as 
one component, such as Monte Carlo Expectation Maximization (MCEM; Caffo, Jank, and Jones, 


2005) or data cloning (Lele, Dennis, and Lutscher, 2007). Nevertheless, the automated generation 


of black-box MCMC algorithms, as occurs in generally available software, does not necessarily re¬ 
sult in efficient MCMC sampling. Analysts are thereby accustomed to MCMC run times measured 
in minutes, hours or even days for large hierarchical models. Computation time is frequently the 
limiting factor, either limiting the range of models considered, or limiting the potential for per¬ 
forming diagnostics and comparisons using methods such as bootstrapping (Efron and Tibshirani, 


1994), cross validation (Gneiting and Raftery, 2007), or calibration of posterior predictive p-values 


(Hjort, Dahl, and Steinbakk, 2006), among others. Therefore, any widely applicable improvements 
to MCMC performance may greatly improve the practical analyses of large hierarchical models. 

Among the many MCMC sampling algorithms developed to improve MCMC efficiency, one of 
the most basic approaches has been block sampling: jointly updating multiple dimensions of a 


target distribution simultaneously (Roberts and Sahu, 1997; Sargent, Hodges, and Carlin, 2000). 
When one or more dimensions of the posterior distribution are correlated, joint sampling of these 
dimensions (with any variety of block samplers) can increase sampling performance relative to 


updating each dimension independently {e.g., Liu, Wong, and Kong, 1994). Despite wide recognition 
of the usefulness of this basic idea for designing efficient MCMC algorithms, there has been no 
automated method for choosing blocks to optimize - or at least greatly improve - performance. 
Here we develop such a method. 

Existing theoretical work comparing block samplers to univariate samplers (Mengersen and 


Tweedie, 1996 Roberts and Tweedie, 1996; Roberts, Gelman, and Gilks, 1997, among others) has 
provided many insights but falls short of providing a complete assessment of MGMG efficiency for 
several reasons. First, it uses MGMG convergence rates as the metric for comparison, without 
consideration of the computational demands of block sampling. Instead, our viewpoint is that any 
measure of MGMG efficiency must incorporate both the convergence rate and the computational 


3 













requirements of achieving improvements in convergence rate. This may give a different picture of the 
actual efficiency of a sampling algorithm. Accelerated convergence at an extreme computational cost 
is obviously not optimal. Second, the computational requirements of different steps will vary greatly 
across platforms, depending on such factors as processor, memory architecture, use of efficient linear 
algebra packages, etc. Therefore, even if theoretical comparisons were extended to incorporate 
aspects of computation, the best block sampling scheme would remain platform-dependent. It is 
important to recognize that computational costs affect not only the proposal step - such as the cost 
of generating a multivariate normal proposal - but also model computations and density evaluations. 
Some parts of hierarchical models may inherently involve expensive computations, which can impact 
the relative efficiency of different blocking schemes. Third, existing theories and methods presume 
that some wise, manual selection of blocks may be feasible, based for example on an understanding 
of the model structure, which leads to understanding which posterior dimensions may be correlated. 
In general, however, it is difficult to know a priori which dimensions will be correlated, which is 
one purpose of automating a procedure like MCMC in the hrst place. 

Here, we present a procedure for the automated exploration of MCMC blocking schemes, seeking 
a highly efficient MCMC algorithm specihc to the hierarchical model and computing environment 
at hand. This represents a higher level of automated algorithm generation than is provided by 
existing software, which serve to produce “one size hts all” MCMC algorithms. The family of BUGS 


packages (WinBUGS, JAGS, and OpenBUGS; Lunn et ah, 2000 Plummer, 2011; Lunn et ah, 2012) 
assigns samplers based on local characteristics of each model parameter, using a combination of 
Gibbs sampling, adaptive rejection sampling, slice sampling, and, in limited cases, block sampling. 


Other MGMG packages including ADMB (Skaug and Fournier, 2006) and Stan (Stan Development 


Team, 2014) use Hamiltonian MGMG sampling (Neal, 2011), which may generally be more efficient 
but nevertheless represents a static approach to MGMG algorithm generation. Yet other promising 


methods such as Langevin sampling (Marshall and Roberts, 2012) are not incorporated into software 
commonly used by practitioners. For simplicity, we restrict our attention to univariate and blocked 
adaptive random walk sampling. However, the main concept of exploring the space of parameter 
blocks to improve MGMG efficiency generalizes to allow the use of other sampling methods. 

In section]^ we examine the pros and cons of univariate versus block random walk sampling, both 
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in terms of algorithmic and computational efficiencies. From these considerations we conclude that 
the combination of samplers that yields optimal MCMC efficiency (defined as an MCMC algorithm’s 
rate of generating effectively independent samples) will be model- and platform-specihc. Section 
introduces a procedure for automated blocking of hierarchical model parameters, designed to 
maximize the resulting MCMC algorithm efficiency. The main idea of this procedure is to iteratively 
cluster model parameters based upon empirical posterior correlations, then intelligently subdivide 
the hierarchical clustering tree (a dendrogram) to determine blockings of parameters that result 
in efficient MCMC sampling. A series of simulated and real data examples given in section 
demonstrate that this procedure can improve MCMC efficiency many-fold over statically defined 
MCMC algorithms and can dynamically generate algorithms comparable in performance to model- 
specihc, hand-tuned algorithms. We close with a discussion in section 

2 MCMC Algorithms: Definitions and Efficiency 

In this section, we hrst dehne a space of valid MCMC algorithms. Then, we examine two domi¬ 
nant contributors to the efficiency of an MCMC algorithm: the algorithmic capability to produce 
effectively-independent samples from the target distribution, and the computational demands of 
the algorithm in generating MCMC samples; these are composed to form our metric of overall 
MCMC efficiency. Drawing upon existing asymptotic theory of MCMC sampling, the scaling of 
computational costs of different sampling schemes, and on several illustrative examples, we argue 
that achieving an optimally efficient MCMC algorithm for a specific model by pure theory is pro¬ 
hibitively difficult. That conclusion motivates our approach of computationally optimizing - or at 
least greatly improving - MCMC performance through automated exploration of a space of valid 
MCMC algorithms. 

2.1 MCMC Definition 

We assume a given, fixed, hierarchical model At, which may be represented as a directed acyclic 
graph where vertices represent top-level model parameters, latent states, or fixed observations 
(data), and edges represent dependencies between these components. Denote the set of all top-level 
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model parameters and latent states (the nnknown components for which we may seek inferences) 
as 0 = { 6 * 1 ,..., 9d}, which will be referred to as the parameters of Ai. 


An MCMC algorithm may be dehned in terms of its sampling scheme over 0. Let b be any 
non-empty snbset (“block”) of 0, and u E U he any valid MCMC sampling (or “npdating”) method 


snch as slice sampling or conjngate Gibbs sampling (see Gilks, 2005, for a broad overview of MCMC 
sampling methods). We define a valid MCMC sampler -0 = u{b) as the application of u to b, where 
b satishes any assnmptions of u {e.g., conjugacy). In addition to satisfying standard properties of 


Markov chains (see, for example, Robert and Casella, 2004), we define a valid MCMC algorithm 
as any set of samplers 4/ = {■^i,..., Ak}, where 'ipi = Uiipi) for i = 1 ,..., A;, satisfying = 0; 

that is, the MCMC algorithm npdates each model parameter in at least one sampler. We represent 
the chain of samples generated from snccessive applications of 4/ as .., where sample 

XO) implies model state 0 = is the set of initial valnes, X^ = (xf\..., X^*^), and 

Xfc = {X^°\X^^\ ...} is the scalar chain of samples of 9k (for k = 1,... ,d). 

This paper focuses attention on the restricted set of sampling methods Uq = {^scalar, Wbiock}, 
where ^scalar denotes univariate adaptive random walk Metropolis-Hastings sampling (hereafter. 


scalar sampling; Metropolis et ah, 1953 Hastings, 1970), and Ubiock denotes the multivariate gen¬ 


eralization of this algorithm (hereafter, block sampling; Haario, Saksman, and Tamminen, 1999), 
with the practical restriction that any ■0 = Mbiock( 6 ) satisfies | 6 | > 1. The Wscaiar algorithm adap¬ 
tively tunes the proposal scale, while Ubiock additionally tunes the proposal covariance (Roberts 


and Rosenthal, 2009). All scalar and block samplers asymptotically achieve the theoretically op¬ 
timal scaling of proposal distributions (and therefore acceptance rates, and mixing) as derived in 


Roberts, Gelman, and Gilks (1997), and implement adaptation routines as set out in Shaby and 


Wells (2011). 


For hierarchical model Ai with parameters 0, our studies focus almost exclusively on the set of 
MCMC algorithms Mi which contains all algorithms of the form 4/ = {■^i,..., V’fc}; where 'ipi = 
Uiipi), each Ui G Uq, and the sets bi form a partition of 0. We specifically name two algorithms in 
which are boundary cases. The first consists of d scalar samplers: Tgcaiar = {Pi, ■ ■ ■, Pd}, where 
each pi = Mscaiar( 6 'i). The second has a single block sampler for all parameters: Tbiock = {wbiock(0)}- 
Implicit is the assumption that each 9i is continuous-valued, which is the case throughout this paper. 
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2.2 Algorithmic Efficiency 


We first consider MCMC algorithmic efficiency, independent of any compntational reqnirements. 
This measnre of efficiency solely represents the best mixing, or eqnivalently the least antocorrelation, 
or the highest effective sample size, withont consideration for the compntational (time) reqnirements 
of generating a set of samples. After reviewing the dehnition of MCMC algorithmic efficiency which 
is based npon integrated antocorrelation time, we stndy the nse of Tgcaiar or Tbiock for particnlar 
choices of M, and qnantify the effects on this measnre of efficiency. 


As in Roberts and Rosenthal (2001), we dehne MCMC algorithmic efficiency as the effective 


sample size divided by the chain length. This represents the rate of prodnction of effectively 
independent samples per MCMC sample. The effective sample size (ESS) of an MCMC chain is 
dehned as ESS = N/t, where N is the chain length and r is the integrated antocorrelation time. 
For a scalar chain of samples Xq, Xi,.. which is assnmed to have converged to its stationary 


distribntion, Straatsma, Berendsen, and Stam (1986) dehne the integrated antocorrelation time as 


r = 1 + 2 cor(Xo, Xj). r may be interpreted as the nnmber of MCMC samples reqnired, on 
average, for an independent sample to be drawn. Onr measnre of algorithmic efficiency is thns r“^, 
the nnmber of effective samples per actnal sample (Thompson, [2010 ). also characterizes the 
speed at which expectations of arbitrary fnnctions of the sample valnes approach their stationary 


valnes (Roberts and Sahn, 1997), and no less satishes the natnral intnition that larger valnes indicate 
better performance. 

For MCMC algorithm T acting on model Ai with parameters 0, we dehne the algorithmic 
efficiency of each 0 G 0 as A{'^,6) = where r is the integrated antocorrelation time of the 
samples of 6 generated from repeated application of T. Overloading notation, we dehne the algo¬ 
rithmic efficiency of MCMC algorithm T as ^(T) = min^ge A(T, 6^). This dehnition is motivated 
by noting that often an MCMC prodnces seemingly good mixing of many model dimensions bnt 
poor mixing of jnst a few dimensions. In this case, the poorly mixing dimensions will limit the 
validity of the entire posterior sample (althongh this is not nniversally trne of all model strnctnres). 
Therefore, we take the conservative approach, and onr general aim is to maximize the algorithmic 
efficiency for the parameter exhibiting the slowest mixing. 
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We now study the potential for algorithmic inefficiency that may result from scalar or block 


sampling, by examining situations in which each are particularly ill-suited. 

Efficiency loss from block sampling 

Consider MCMC algorithm ft'block ^ ^m- Application of f^biock generates a random proposal 
vector X* G then jointly accepts or rejects X*. In the framework of the sampling method 
Wbiock, ~ cr^S), where fi and S vary according to current state of the MCMC chain and 

properties of the target stationary distribution, but not proportionally to d. Roberts, Gelman, and 


Gilks (1997) show that in order to achieve the asymptotically optimal acceptance rate, and therefore 
sample chain mixing, aj oc d~^. As a consequence of this attenuation in the proposal variance, the 
rate at which ft' block explores the space of M'^, and accordingly A(f['biock), is proportional to d~^. 
This result applies equivalently to block samplers ■0 = Mbiock(^) acting on subsets 6 C 0, where the 
algorithmic efficiency (for the elements of b) achieved by application of if is inversely proportional 
to the number of elements in b. 

This result has an important implication on block sampling. All other factors being equal {e.g., 
effect of posterior correlations), a block sampler of dimension k must generate k times more samples 
to have the same effective sample size as those samples produced through scalar sampling (Roberts 


and Rosenthal, 2001). This inefficiency is inherent to block sampling and scales with the dimension 


of any block sampler. 


Efficiency loss from scalar sampling 

To study the potential loss of algorithmic efficiency which may result from scalar sampling under 
fl^scaiar ^ ^Mi we cousider correlated posterior distributions. It is well-understood that strong 
posterior correlations can retard the speed of convergence of MGMG sampling {e.g., Roberts and 


Sahu, 1997 Gilks, 2005), and that block sampling can alleviate this. However, the nature of 


this inefficiency is not characterized, in particular as a function of the degree of correlation and 
number of dimensions exhibiting correlation. We undertake a simulation study, to gauge how these 
factors effect algorithmic efficiency. Gonsider target distribution Nd(0,E), where the covariance 
(equivalently, correlation) matrix S consists of Is on the main diagonal and p elsewhere, |p| < 1. 






We consider the algorithmic efficiencies of individnal model parameters under scalar sampling, 

^('hscalar, 0) for 0 G 0 . 

Empirically, we observe that each y4(\hscalar, 0) tends toward zero as p approaches one, or as d 
diverges (p 7 ^ 0). The nature of these relationships is characterized on a log-log scale in Figure 
(left), where the horizontal axis plots — log(l — p), such that positive horizontal shifts represent p 
exponentially approaching the boundary p = 1 , or perfect correlation between parameters. As one 
would expect, when p = 0 all values of d yield identical algorithmic efficiency. However, when p > 0 
we enter a linear regime where each A(Tscaiar, Q) exponentially decays towards zero. Even for hxed 
d, algorithmic efficiency under Tgcaiar can be arbitrarily poor as p approaches unity. 



I I I I I I 

0 12 3 4 5 
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Figure 1: MCMC algorithmic efficiencies for different values of model dimension (d), and intra¬ 
group correlation (p). The quantity — log(l — p) is plotted on the horizontal axis. Model structures 
include constant off-diagonal elements (equal to p) in the induced correlation matrix (left), and 
exponentially decaying correlations (right). 


It may be extreme to assume a target distribution with arbitrarily large blocks of parameters that 
exhibit arbitrarily high pairwise-correlation. As an alternative, we consider the same multivariate 
normal form, but with elements of S given as cTjj = |p| < 1 , as might occur in the covariance 


structure of a spatial model (Banerjee, Carlin, and Gelfand, 2003, p.27). The algorithmic efficiency 
of the hrst model parameter - since elements of 0 are no longer exchangeable - is shown in Figure 
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(right), where it displays similar attenuation as in the prior example. Most notably, we now observe 
the incremental effect of d diminishing as d increases, consistent with the covariance structure. 

Through a combination of theory and simulated examples, we observe that the algorithmic 
efficiency achieved under Tgcaiar or Tbiock will depend non-trivially upon the model dimension, 
and the extent and structure of the posterior correlation. We have only considered two simple, 
highly modular and systematic posterior correlation structures. In practice, any model of interest 
will demonstrate a substantially more complex correlation structure (which is unknown, anyway), 
making a full analytical study of MCMC algorithmic efficiency difficult. No less, we have only 
considered the two boundary-case algorithms Tgcaiar and Tbiock in ^m- Onr desire to derive general 
results for algorithmic efficiency will be complicated substantially further when we consider the 
complete set ^m- 


2.3 Computational Efficiency 


While section |2.2 considered the efficiency of MCMC algorithm T at producing independent sam¬ 
ples without regard for computation time, we now consider the computational requirements of T, 
measured in units of algorithm runtime per MCMC iteration. We focus on computations for model 
density evaluations. There are also book-keeping and loop-iteration costs, which we label as algo¬ 
rithm overhead. These overhead terms comprise a small fraction of total computation, and we safely 
disregard them. Denote the computational requirement of T as C'(T), and again overload notation 
to define C(V’) as the computational requirement of a single sampler ijj. For T = {'ipi,... ,'ipk}, 
C(^') = aware, an analysis of MCMC efficiency which incorporates 

C(T) has not been carried out to date. Literature which does address MCMC efficiency typically 
recognizes that a computational aspect exists, but then focuses solely on A(T), e.g.^ Roberts and 


Sahu (1997). 


We now present an accounting of the main contributions to (^(Tgcaiar) and C'(\['biock), general 
for any M.. To support our accounting, we denote the set of all fixed and known components 
oi M. {e.g., observations, other data) as Y, which is disjoint from the unknown set of parameters 
0 = {9i ,..., 9d}. For each 9i G 0, let Xj C 0UF be the set of model components which immediately 
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depend on 9i, or the set of direct descendants of 9i in the model graph introduced in section 2.1 


Finally, we denote the computational cost of evaluating the density functions corresponding to any 
subset X G Q UY as dens(a;). 


Scalar Sampling Computation 

On each iteration of d'scalar = {'ipi, ■ ■ ■ y'lpd}, each sampler 'ipi will incur computational cost = 

dens(6'j) + dens(a;j) + 0(1). The trailing constant term represents generation of the proposal value 
and making the MH decision (generation from normal and uniform distributions, respectively). The 
total computational requirement of Tscaiar is thus 

d d d 

^scalar) ^ ^ ^^("00 E dens(6'j) + dens(a;j) + 0{d). 
i=l 2=1 2=1 

Note that under Tgcaiar, each density evaluation dens(6*j) must occur independently. This is true 
even when the evaluation of a particular dens(6'j) term necessitates the calculation of a subsuming 
multivariate density - in the most extreme case, dens(0). Thus, in the worst case, a single MCMC 
iteration of Tscaiar could incur d evaluations of the entire joint density of 0. A similar computational 
explosion can result from the calculation of each dens(a:j) term. 


Block Sampling Computation 


We now consider the components of C(Tbiock)- On each iteration of Tbiock, the sole sampler Mbiock(0) 
requires evaluation of the complete prior and likelihood densities, dens(0 U Y). This is notably dif¬ 
ferent from the density evaluation terms appearing in (^(Tscaiar); in that it incurs only a single 
evaluation of the complete joint model density. In addition, the adaptation routine of Mbiock(0) 
requires a Cholesky factorization of the adapted covariance matrix, which requires 0{d^) operations 
to calculate in full generality (Trefethen and Ban, 1997| p.l76). This factorization occurs every AI 
iterations, where AI is the adaptation interval of Wbiock, and therefore has an amortized computa¬ 
tional cost of 0{d^/AT). Each iteration of Wbiock requires generating a d-dimensional multivariate 
normal proposal, which requires 0{d‘^) operations, and performing a single constant-time MH de¬ 
cision, which is dropped as a lower-order term. The total amortized computational requirement of 
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^biock may be written as 


C(^biock) = dens(0 U F) + 0{d^/M) + 0{d^). 


Timing Comparison 

We have seen that ^(tl/block) is at least qnadratic in d, and technically cnbic bnt with a small lead¬ 
ing coefficient. Depending on the distribntional strnctnre of 0, the density evaluations comprising 
C(^scaiar) may be unwieldy. The relative magnitude of these competing terms is difficult to intu¬ 
itively gauge, so to gain practical insight, we perform a timing study of the Tscaiar (All Scalar) and 
'hbiock (All Blocked) algorithms. Three models involving no likelihood components are considered, 
with prior structures on 0 given as: 

• 6*j ~ N(/i, a) for each 9i E Q 

• 9i Gamma(a;, fd) for each 9i E Q 

. 0~Nrf(/i,S) 

Figure 1^ presents timing results measured in seconds per 10,000 iterations, plotted against di¬ 
mension d, without consideration of algorithmic efficiency (section [2.2[ ). There are a number of inter¬ 
esting features, which we briefly summarize. C(Tscaiar) is 0(d) when each d* independently follows a 
univariate distribution, and therefore dens(dj) = dens(0) < d-K, where K = max^ge dens(d). 
For all practical purposes, it appears that 0(\kbiock) is 0{d?), as the cubic coefficient 1/AI = 1/200 
is relatively small. C'(Tbiock) is largely resilient to changes in the underlying distribution of 0; 
univariate gamma densities are more costly than normal densities, as we would expect, and as for 
O(Tscaiar); and the multivariate normal structure even slightly more so. Perhaps most striking, 
O(Tscaiar) is O(d^) wheu the underlying distribution of 0 is multivariate, since each multivariate 
normal density evaluation is O(d^), which occurs d times for each iteration of Tscaiar- Although both 
are cubic in d, C'(Tscaiar) dwarfs (^(Tbiock) due to the difference in the leading cubic coefficients. 
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Figure 2: MCMC runtimes for the All Scalar and All Blocked algorithms, for univariate normal, 
univariate gamma, and multivariate normal model structures. 

2.4 Overall Efficiency 

We have examined both the algorithmic and computational efficiency of MCMC algorithms, each 
of which fundamentally affect overall MCMC efficiency. We dehne the overall efficiency of d' simply 
as F'(d') = A(\[')/C(\I/). We consider this to be a sensible measure of the overall efficiency of an 
MCMC algorithm, since F'(\l') may be interpreted as the number of effective samples produced per 
unit of MCMC algorithm runtime, for the slowest mixing model parameter. If we can construct 
to maximize then is the most time-efficient MCMC algorithm for generating effectively 

independent samples to approximate the true, joint posterior distribution of 0. 

That being said, from our examination of algorithmic and computational efficiency, it is not 
immediately clear how to balance tradeoffs between ^(T) and C'(T) to maximize B(^). We have 
generally considered the two boundary-case algorithms Tgcaiar and Tbiock, but a huge number of 
intermediate algorithms exist. For T G we may gain useful insights regarding the factors 

affecting B(^) in terms of the properties of each 'ipi. C(T) = and the values A(T, 9i) 

which determine A(\h) each result from a single application of scalar or block sampling - although 
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this neglects the phenomenon where improving v4('h, 6i) may affect v4('h, 6j), i ^ j. However, hnding 
a global optimum tl'opt = argmax^g^^ii^(\h) poses a combinatorial challenge. Instead of seeking 
'hopt, we now propose an iterative procedure to navigate Mi with the aim of maximizing E{'$) to 
the degree possible. 

3 Automated Blocking 

In this section, we propose an iterative, self-tuning procedure for automated blocking of hierarchical 
model parameters to produce an efficient MCMC algorithm. This procedure uses the empirical 
posterior correlation to cluster groups of correlated parameters into sampling blocks. A hierarchical 
clustering tree of model parameters is constructed, and subsequently cut at some height (selected 
using a finite search) to produce parameter groups each exhibiting a minimal intra-group posterior 
correlation. This procedure is iterated, so that as MCMC efficiency improves, the empirical posterior 
correlations are more accurate, and the resulting tree and parameter groups stabilize. The end- 
result is a partition of the model parameters, which uniquely specihes an MCMC algorithm T G 
employing scalar and block sampling, for which the overall efficiency i?(\k) (section [2.4[ ) is increased 
to the degree possible. We also discuss more sophisticated approaches, but our heuristic approach 
allows huge efficiency gains in some cases and establishes the basic procedure. 

3.1 Procedure 

Assume a given, hxed, hierarchical model Ai, with parameters 0 = {0i,... ,0d}- Algorithm]^ 
presents pseudocode for our automated blocking procedure, which produces MCMC algorithm 
'hAutoBiock £ ^M- Subscripting indices j and k are understood to take all values in 1,..., d. 

The procedure begins with the initial MCMC algorithm Tq = Tscaiar, or scalar sampling of all 
model parameters; lacking prior information, this is used as the starting point for gaining insight 
about the posterior correlation structure. Subsequent iterations are based upon the empirical 
posterior correlation produced in the previous iteration, and, to a varying degree, will institute 
blocks for parameter groups exhibiting sufficiency high intra-group correlations. 

Prior to calculating the empirical correlation terms pj^k, we discard the seemingly excessive and 
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Algorithm 1 Automated Blocking 
1; i 0 

2: d'o ^ d^scalar 
3: loop-. 

4: i ^ i + 1 

5: Generate samples of 0 under where Xj represents the sample chain of 9j 

6 : Discard initial 50% of each chain Xj 

7: ^ cor(Xj, Afc) 

8- ^j,k ^ 1 

9: Construct distance matrix D from elements dj^k 

10 : Construct hierarchical clustering tree T from D 

11 : ^cand ^ {^(^cut=o)) ^(^cut=0.l), ^(^cut=0.2); • • • , lll(^cut=l)} 

12: ^ argmaxq,g^^^^^E(d') 

13: if E{'^i) > and 'hj % then goto loop 

14: lI/AutoBlock ^ llli 
15: retnrn d'AutoBiock 


somewhat arbitrary initial 50% of all samples. This should not be confused with a traditional “burn- 
in,” whose purpose is to “forget” the initial state and ensure convergence to the target distribution. 
Instead, discarding these initial samples allows all adaptive scalar and block samplers ample time 
to self-tune, and thereby achieve their theoretically optimal algorithmic efficiency. The choice of 
50% is largely arbitrary, and excessive in most cases, and could almost certainly be relaxed without 
affecting algorithm performance. 

Empirical correlations are transformed into distances using the transformation dj^k = 1 — 

The form of this transformation is selected to induce several properties for elements of the distance 
matrix D-. the main diagonal consists of zeros; strong correlation results in d 0; weak or zero 
correlation results in d ^ 1; and correlations of p and —p result in the same distance. 

We use the R function hclust to create the hierarchical tree T from the distance matrix D. The 
default “complete linkage” clustering (Everitt, [2011 chapter 4) is appropriate, since this ensures 
that all parameters within each cluster have a minimum absolute pairwise correlation. At height 
h G [0,1] in T, the absolute correlation between parameter pairs (within clusters) is at least 1 — h. 

We use the R function entree for cutting the hierarchical clustering tree T at a specified height 
h G [0,1] to produce disjoint parameter groupings, which may be used to define parameter blocks for 
the purpose of MCMC sampling. We justify this means of generating parameter sampling blocks, 
insofar as to increase algorithmic efficiency we strive to group correlated parameters into sampling 
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blocks - the exact effect of cutting T at any particular height. 

We define the MCMC algorithm '^{Tcut=h) G as the unique MCMC algorithm defined by 
scalar and/or block sampling applied to the parameter blocks that result from cutting T at height 
h. We note that for all T, d'(Tcut=o) = ^scalar, and d'(Tcut=i) = ^biock- 

There is no universally optimal value of the cut height h, as our findings in section [pimply that 
any h G [0,1] may maximize the efficiency T^(\I/(Tcut=/i)) for a particular model Ai. We recognize 
that a combination of distinct cut heights applied to different branches of T may produce the 
maximal efficiency, but we do not consider such strategies herein. 

Rather than attempting to infer what might be an appropriate cut height for model Wl, we 
consider a range of potential cut heights, and the resulting MCMC algorithms. These comprise 
^cand, the candidate set of MCMC algorithms for a particular iteration. This approach allows the 
blocking procedure to adjust according to the model, the posterior correlation structure, and the 
underlying computational architecture. The MCMC algorithm for each iteration {i > 1) is selected 
from among ^'cand as that which produces the maximal overall efficiency. 

To estimate the integrated autocorrelation time, and hence the algorithmic efficiency, of a chain 
of MCMC samples, we use the effectiveSize function in the R coda package (Plummer et ah. 


2006). The approach underlying this function - using a fitted autoregressive model to estimate the 


spectral density at frequency zero - has been seen to converge fastest among several methods to 


the true integrated autocorrelation time (Thompson, 2010). 

As R'('hj) increases through the course of iterations, improved estimates of the posterior correla¬ 
tion are produced, giving the potential for more refined parameter blockings, and thus progressive 
increases in T^(Tj) in subsequent iterations. This iterative procedure continues until either (1) 
Tj = Tj_i (identical algorithms are selected on consecutive iterations), or (2) E{^i) < ii^(Tj_i) 
(efficiency decreases between iterations). In practice, our procedure typically halts with terminat¬ 
ing condition (1). This may be concurrent with terminating condition (2), on account of stochastic 
variation in sampling and/or runtime. 

We select the output from our automated blocking procedure as 'hAutoBiock, the MCMC algorithm 
selected in the final iteration. In our experience, TAutoBiock is typically identical to the MCMC algo¬ 
rithm of the second-to-last iteration; that is, the procedure has converged to a stationary state. If a 
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situation arises where the hnal iteration produces a different MCMC algorithm with efficiency infe¬ 
rior to that of the previous iteration, then prudence would suggest a thoughtful examination of the 
posterior samples, empirical correlation matrices, properties of the adapted samplers, convergence 
diagnostics, etc. 


4 Automated Blocking Performance 

We now compare the performance of MCMC algorithms produced using the automated blocking 
procedure of section against various static MCMC algorithms. First, we describe the computing 
environment in which our analyses are performed. We then describe a broadly representative suite 
of example models, and present the performance results of automated and static MCMC algorithms 
for each. A public Github repository containing scripts for reproducing our results may be found 
at https://github.com/danielturek/automated-blocking-examples. 


4.1 Computing Environment 


Since one of our points is that optimal design of MCMC algorithms depends on the computing 
environment, we briefly summarize the software tools and computing platform used. All statistical 
models and MCMC algorithms were built using the NIMBLE package (NIMBLE Development 


Team, 2014) for R (R Core Team, 2014). NIMBLE allows hierarchical models to be defined within 


R using the BUGS model declaration syntax introduced by the BUGS project (Lunn et ah, 2000 


Lunn et ah, 2012). MCMC algorithms in NIMBLE are written using NIMBLE’s domain specific 


language for specifying hierarchical model algorithms. This language is an enhanced subset of R 
(interfaced through an R session) which is compiled into C-|—I- code, which is subsequently compiled 
and run. 

As a result, the examples here use highly efficient code generated automatically for each model 
and algorithm. Of particular importance is that matrix operations are done via the highly optimized 


C-|--|- Eigen library (Guennebaud and Jacob, 2010). Finally, the high-level programmability provided 
by R facilitated the dynamic exploration of MCMC algorithms. Examples were run using R version 
3.1.2, using the BLAS (Basic Linear Algebra Subprograms) provided by R for multivariate density 
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calculations and simulation, and running under Macintosh OSX version 10.9.5 on a 2.5 GHz Intel 
Core i7 processor. 

4.2 Model Descriptions 

We tested the automated blocking procedure on seven examples, including real-data and toy models, 
and compared the results against standard MCMC algorithms. When there are obvious “hand- 
tuned” algorithms that a seasoned MCMC practitioner would consider for a particular model, 
we included those as well. For the toy models, the goal was to construct posterior distributions 
with specihc correlation structures as described below. In these cases the models are simply prior 
distributions without any likelihood component. 

Varying Size Blocks of Fixed Correlation 

This model structure demonstrates parameter groups of varying size, where each group exhibits a 
hxed intra-group pairwise correlation. The model contains iV = 64 parameters, half of which are 
grouped to have pairwise posterior correlation of p. This is accomplished using a prior multivariate 
normal distribution with appropriate covariance (equivalently, correlation) matrix, which in the 
absence of a likelihood term fully determines the posterior distribution for these 32 parameters. 
Similarly, additional disjoint groups of correlated parameters are constructed of sizes 16, 8, 4, and 
2. The remaining two parameters are uncorrelated to any others, specihed using univariate normal 
prior distributions. We consider three values for the intra-group correlation, p = 0.2, 0.5, and 0.8. 
As these models have no likelihood, we are using MCMC to sample from the prior distribution. We 
note that the dependence structure is the same as the block-diagonal covariance structure (with the 
blocks having compound symmetry) obtained when analytically integrating over the exchangeable 
prior means of clustered random effects. This example thereby mimics the structure found in basic 
multilevel hierarchical models, albeit without the explicit computational expense of a likelihood 
calculation. 
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Fixed Size Blocks of Varying Correlation 

The next model structure exhibits hxed size groupings of parameters, with posterior correlations 
ranging between 0 and 0.9. Each such model contains N = lOn parameters. Again employing 
multivariate normal distributions, we induce nine disjoint groupings of n parameters each, having 
intra-group pairwise correlations of 0.1, 0.2, ..., and 0.9. The remaining n parameters are fully 
uncorrelated. Three such models of this structure are constructed, using the values n = 2, 5, and 
10. As in the previous models, these do not include any likelihood term. 


Random Effects Model 


We select the “litters” model from among the original example models provided with the MCMC 
package WinBUGS. This random effects model contains two groups of 16 binomial observations. 
Within each group, the binomial probabilities are modeled as random effects arising from a beta 
distribution. The particular parameterization of the beta distributions (in terms of a and /?, rather 
than /i and a) results in strong correlations between each aj,/?* pair. The WinBUGS manual 
comments upfront that this model exhibits slow mixing. We consider an informed MGMG algorithm, 
which blocks each ai, jSi pair. In addition, the beta-binomial conjugacy relationships permit use of 
cross-level sampling, where we jointly sample top-level parameters and conjugate latent states, as 


used by Rue and Held (2005, p. 141-143) 


Auto-Regressive Model 

We select the “ice” model from among the examples provided with WinBUGS as an auto-regressive 


(AR; Harvey, 1993) example, which is also analyzed in Breslow and Glayton (1993). The data 
contains 77 incident counts of breast cancer occurring in Iceland, which are modeled as Poisson 
counts. Explanatory variables include age group, year of birth (represented using 11 cohorts ranging 
between 1840 and 1949), and the total person-years for the subjects in each group. The model uses 
second-order AR smoothing of birth cohort effects. 
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Linear Gaussian State Space Models 


We construct two linear Gaussian state space models (Durbin and Koopman, 2012) each consisting 


of 100 latent states and observations. State transitions are governed by a first order AR process, and 
we seek inferences about the transition process, and the system and observation noise. We consider 
two equivalent parameterizations of the state transition process. First, in terms of the intercept and 
mean of the AR process, which have largely uncorrelated posteriors (independent parameterization), 
and second, in terms of the intercept and autocorrelation, which are known to be highly correlated 
(correlated parameterization). For the correlated parameterization, we consider an informed MCMC 
algorithm, which blocks the intercept and autocorrelation parameters. We deliberately include this 
inferior parameterization, to assess MCMC performance in the case of known strong posterior 
correlation. In practice, an analyst may not know which model parameterization(s) will produce 
uncorrelated posterior dimensions. 

Spatial Model 

We consider a spatially dependent hierarchical model. The data consist of 148 measurements of 
scallop abundance at various locations off the New York and New Jersey coastline, and was collected 
by the Northeast Fisheries Science Center of the National Marine Fisheries Service in 1993. The 
data set is publicly available at http: //www.biostat .umn.edu/~brad/data/myscallops . txt, and 


is analyzed in Banerjee, Carlin, and Gelfand (2003), pages 44-65. Following Banerjee, Carlin, and 


Gelfand (2003), we model the mean log-abundance as multivariate normal with covariance that de¬ 


cays exponentially as a function of distance. The covariance is given by coY^gi, gj) = exp{—dij/p), 
where the observations are modeled as Poisson counts yi ~ Poisson(exp( 5 fj)), and dij is the distance 
between observations and yj. Since this covariance structure induces a trade-off between a and 
p, we expect these parameters to be correlated in the posterior distribution. 

Generalized Linear Mixed Model 


We include a reasonably sized generalized linear mixed model (CLMM; Celman and Hill, 2006 


chapter 6). We make use of the Minnesota Health Plan dataset available in Waller and Zelterman 
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(1997) and follow the analysis of Zipnnnikov and Booth (2006). The dataset contains 968 connts 
of senior-citizen clinical visits, which are modeled as Poisson connts. The linear predictor contains 
hxed and random effects, nsing a variety of covariates and inclnding several interaction terms. 


4.3 Performance Results 

We present three qnantities to gange the performance of MCMC algorithm T. Rather than algo¬ 
rithmic efficiency 4(T), for convenience of interpretation we present the proportional qnantity ESS 
= 10,000 A(T), where ESS denotes effective sample size. This scaling of 4(T) has a natnral inter¬ 
pretation as the nnmber of effective samples (for the slowest mixing parameter) which resnlt from 
a chain of 10,000 MCMC samples. Similarly, to represent the compntational reqnirement C'(T), we 
present the proportional qnantity Rnntime = 10,000 C'(\h), interpretable as the time (in seconds) 
reqnired to generate 10,000 MCMC samples. We directly present the overall MCMC efficiency as 
Efficiency = ESS / Rnntime = 74(\I/)/C(T) = E(T), which is independent of any scaling, and 
maintains the intnitive interpretation as the nnmber of effective samples generated per second of 
algorithm rnntime (again, for the slowest mixing parameter). MCMC sampling is performed nsing a 
fixed random nnmber seed and identical initial valnes for each model, so identical MCMC algorithms 
will prodnce identical sample chains, and hence ESS, bnt not necessarily Rnntime or Efficiency on 
acconnt of discrepancies in algorithm rnntime. We observe the antomated procednre prodncing 
the same MCMC algorithm across repeated experiments, with nnmerical resnlts for Rnntime and 
Efficiency varying less than 5% from those presented herein. 

For each example model M, we present resnlts for MCMC algorithm Tbiock denoted as “All 
Blocked,” and those of Tgcaiar as “All Scalar,” noting that Tscaiar also represents the initial state 
(0*^ iteration) of the antomated blocking procednre. The maximally efficient algorithm generated 
via antomated blocking is presented as “Anto Blocking,” which will generally represent a dynam¬ 
ically determined blocking scheme. We also present a third static MCMC algorithm, which is not 
necessarily a member of on acconnt of the possible nse of conjngate sampling. This algorithm 
assigns block samplers to gronps of parameters arising from mnltivariate distribntions, scalar sam¬ 
plers to parameters arising from nnivariate distribntions, and assigns conjngate samplers whenever 
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the structure of M permits; this static algorithm may be more representative of default MCMC 
algorithms provided by software packages, and is denoted as “Default.” Finally, for several example 
models we include an informed blocking of the model parameters, based upon expert or prior knowl¬ 
edge, which is referred to as “Informed Blocking.” Results for the random effects model also include 
the “Informed Cross-Level” MCMC algorithm which makes use of cross-level sampling, which is 
not in 

Varying Size Blocks of Fixed Correlation 

The left pane of Figure displays the Efficiency performance for the model structures containing 
varying sized blocks of fixed correlation. For p = 0.2, the Auto Blocking algorithm selects cut height 
h = 0, which corresponds to re-selecting the algorithm All Scalar. Since this MCMC algorithm is 
identical to the initial state, the automated procedure terminates there. The All Blocked scheme 
actually runs faster, but the algorithmic efficiency loss inherent to large block sampling dominates, 
resulting in Efficiency approximately four times lower. For larger values of p, the All Scalar algo¬ 
rithm suffers progressively more since it fails to institute any blocking in the presence of increasing 
correlations. For p = 0.5 and 0.8, Auto Blocking algorithm selects cut heights h = 0.6 and h = 0.3, 
respectively, which each exactly place all correlated terms into sampling blocks. In every case, the 
slowest mixing parameter is from among the largest correlated group of 32 parameters. 

Fixed Size Blocks of Varying Correlation 

The right pane of Figure presents results for the model structure containing fixed size parame¬ 
ter groupings with correlations between 0 and 0.9. For each size model, the automated blocking 
procedure selects a particular cut height (and hence, MCMC algorithm) twice consecutively, thus 
terminating on the third iteration. The cut heights selected for models N = 20, 50, and 100 are h = 
0.5, 0.8, and 0.9, respectively (not shown), progressively pushing more of the correlated parameter 
groupings into sampling blocks. The Auto Blocking algorithm produces increases in Efficiency by 
factors of 4.5, 7, and 21 in the three models, over the static All Scalar and All Blocked algorithms. 
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Correlation Model size (N) 

Figure 3: Efficiency results for two contrived model structures: varying sized blocks of fixed corre¬ 
lation (left), and fixed sized blocks of varying correlation (right). 

Random Effects Model 

In the random effects model (Table [^, automated blocking generates an MCMC algorithm identical 
to the Informed Blocking algorithm (blocking each Oj, /?* pair), which produces a tenfold improve¬ 
ment in Efficiency over the most efficient static algorithm - for this model. All Scalar sampling. 
The cut height h = 0.1 indicates that only the Oj, f3i pairs exhibit posterior correlations above 0.9. 
The Informed Cross-Level algorithm requires a substantially longer Runtime and produces a high 
ESS, which results in nearly identical Efficiency as the efficiently blocked Auto Blocking algorithm. 

Auto-Regressive Model 

In the auto-regressive model (Table [^, an AR process value exhibited the slowest mixing under 
All Scalar sampling. When all 24 model parameters (AR process values, hxed effects, and one 
hyper-parameter) are blocked, the algorithm Runtime is nearly halved. This decrease in Runtime is 
largely due to the dependency structure inherent to the AR process. Scalar sampling of AR process 
values requires nearly a three-fold increase in density evaluations of the process values (since it’s a 
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second-order AR process) relative to All Blocked sampling. In addition to the improved Runtime, 
the All Blocked sampling of the correlated AR process values increases their individual algorithmic 
efficiencies, and the slowest mixing parameter is among the hxed effects. The Efficiency under All 
Blocked sampling is over double that of All Scalar sampling. The automated blocking procedure 
identihes a blocking scheme which blocks together all AR process values and hxed effects (23 total; 
cut height h = 0.4), and performs univariate sampling of the single hyper-parameter. This has a 
similar Runtime to All Blocked sampling, but increases algorithmic efficiency for all parameters. 
The resulting overall Efficiency under the Auto Blocking MCMC algorithm is over three times that 
of All Scalar sampling. 


Model 

MCMC Scheme 

ESS 

Runtime 

Efficiency 


All Blocked 

0.4 

0.29 

1.3 


Default 

1.1 

1.19 

1.0 

Random 

All Scalar 

2.1 

0.51 

4.2 

Effects 

Informed Blocking 

19.0 

0.50 

38.2 


Informed Cross-Level 

101.3 

2.64 

38.5 


Auto Blocking 

19.0 

0.48 

39.2 


All Blocked 

8.9 

0.3 

27.3 

/WITjO” 

All Scalar 

6.5 

0.6 

11.5 

Regressive 

Auto Blocking 

12.7 

0.3 

37.5 


All Blocked 

0.3 

0.8 

0.4 

State Space 

Default 

27.6 

4.6 

6.0 

Independent 

All Scalar 

20.2 

1.3 

15.7 


Auto Blocking 

29.1 

1.3 

22.4 


All Blocked 

0.6 

0.7 

0.8 


Default 

1.7 

4.9 

0.4 

State Space 

All Scalar 

1.1 

1.3 

0.8 

Correlated 

Informed Blocking 

18.4 

1.2 

15.6 


Auto Blocking 

26.1 

1.2 

20.9 


All Blocked 

0.2 

5.71 

0.04 

Spatial 

Default 

0.4 

10.86 

0.04 

All Scalar 

171.3 

83.87 

2.0 


Auto Blocking 

1208.0 

78.62 

15.4 


All Blocked 

2.2 

44.3 

0.05 

GLMM 

All Scalar 

60.9 

22.6 

3.0 


Auto Blocking 

60.9 

22.6 

3.0 


Table 1: MCMC performance results for the suite of example models. Effective sample size (ESS) 
is measured in effective samples per 10,000 iterations. Runtime is presented as seconds per 10,000 
iterations, and Efficiency is in units of effective samples produced per second of algorithm runtime. 
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Linear Gaussian State Space Models 


Table presents results for both parameterizations of the state space model. In the Independent 
parameterization, the observation noise parameter is the slowest mixing, in all except the All Blocked 
algorithm. The All Blocked algorithm runs quickly, but is limited by the extremely low algorithmic 
efficiency of the AR process intercept parameter. The Default algorithm assigns conjugate normal 
samplers to each latent state, resulting in high algorithmic efficiency but a substantially longer 
Runtime, which diminishes the overall Efficiency. Auto Blocking (cut height h = 0.8) creates 
a block of six parameters containing Eve latent states and the observation noise, and a disjoint 
block of the two AR process parameters. This combination, unlikely to be discovered though any 
combination of prior knowledge or trial and error, produces a 40% increase in Efficiency over All 
Scalar sampling, which is the most efficient static MCMC algorithm. 

We suspect the intercept and autocorrelation parameters of the AR process to be correlated in 
the naive parameterization of the state space model. The All Blocked algorithm once again runs 
quickly, but is limited by the ESS of the AR process intercept. The Default algorithm is again 
slow due to conjugate sampling, but similar to the All Scalar algorithm, produces low algorithmic 
efficiency of the correlated AR process parameters. The Auto Blocking algorithm (cut height h = 
0.1) selects the same parameter block as in the Informed Blocking algorithm (AR process intercept 
and autocorrelation), and additionally a block containing the observation noise and a latent state. 
The Runtimes are, accordingly, nearly identical, however the ESS of the observation noise, the 
limiting parameter, increases. Automated blocking produces Efficiency over 20 times higher than 
the All Blocked algorithm, which is the most efficient static algorithm, and 25% higher than the 
Informed Blocking algorithm. It is important to note that the automated blocking procedure 
overcame the sampling inefficiencies introduced by this naive parameterization, without requiring 
user intervention. 

Spatial Model 

MCMC performance results for the spatial model (Table display several interesting trade-offs in 
MCMC efficiency. The spatial model contains 148 latent parameters jointly following a multivariate 
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normal distribution, and three top-level parameters that govern this distribution (/r, a, and p). 
As there are no conjugate relationships between parameters, the sole difference between the All 
Blocked algorithm and the Default algorithm is the inclusion of these top-level parameters in the 
large sampling block. Therefore, the five second difference in Runtime can be attributed to three 
fewer multivariate density evaluations (per MCMC iteration) under the All Blocked algorithm. 
However, under either algorithm, the blocked sampling of latent parameters produces extremely 
low ESS values of 0.2 and 0.4 among the latent parameters. The minimal ESS value increases by a 
factor of two when the top-level parameters are removed from the large block sampler, and thereby 
achieve better mixing. 

The All Scalar algorithm frees all latent parameters from block sampling. Each scalar sampler 
requires its own, independent, evaluation of the latent multivariate density, hence the Runtime of the 
All Scalar algorithm increases dramatically. That being said, the ESS values of the slowest mixing 
latent parameters under the All Blocked and Default algorithms both increase to approximately 4000 
(not shown), p is the slowest mixing parameter under the All Scalar algorithm, with ESS increased 
from 139.6 (under the Default algorithm) to 171.3, even though it underwent scalar sampling in both 
cases; this is another example of the slowest mixing parameter affecting the algorithmic efficiency 
of other model parameters. 

The automated blocking procedure selects cut height h = 0.1, which produces a single block 
containing p and a; this indicates an empirical posterior correlation of at least 0.9 between these 
parameters. The ESS of p increases to approximately 1500 (not shown). A latent parameter once 
again produces the slowest mixing with ESS of 1208, which produces nearly a tenfold increase 
in Efficiency relative to the All Scalar algorithm. The Runtime of the Auto Blocking algorithm 
decreases slightly compared to the All Scalar algorithm, since the single block sampler induces one 
fewer evaluation of the latent multivariate density. 

Generalized Linear Mixed Model 

We first note that our GLMM model is by far the largest example considered, containing nearly 
2000 stochastic model components (including observations); so we anticipate comparatively low 
MCMC Efficiencies regardless, since MCMC algorithms simply take time to carry out all model 
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calculations. For this model (Table [^, the automated procedure converges on the All Scalar al¬ 
gorithm, which is the same as its initial state, and which produces overall MCMC Efficiency of 
about 3. In hindsight this result may not surprise us, since the fully exchangeable nature of the 
random effects in this model does not induce correlations among the sampled parameters for this 
particular dataset. Correspondingly, for a large number of un-correlated random effects, and in the 
absence of multivariate distributions, univariate sampling produces the highest Efficiency. We also 
note that the All Blocked algorithm, which consists of a single block sampler of dimension 858, has 
Runtime approximately twice that of the All Scalar algorithm, and produces an overall Efficiency 
of approximately 0.05. 



Figure 4: Efficiencies of MCMC algorithms for the suite of example models. 


Efficiency Gains from Automated Blocking 

In Figure]^ we present the overall Efficiencies achieved for our suite of example models (excluding 
the two contrived model structures). The Auto Blocking algorithm consistently out-performs any 
static algorithm in terms of Efficiency, ranging between roughly a 50% increase to several orders of 
magnitude of improvement. The exception is the GLMM example, in which Auto Blocking matches 
the All Scalar algorithm identically. We observe variation in the relative Efficiencies among the 
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static algorithms, reinforcing our notion that overall MCMC efficiency is highly dependent upon 
hierarchical model structure, and attempting to infer what might be an efficient MCMC algorithm 
for a particular problem is, in general, difficult. 


5 Discussion 


We have presented a general automated procedure for determining an “efficient” MCMC algorithm 
for hierarchical models. Our procedure is a greedy, iterative algorithm, which traverses a finite and 
well-defined set of MCMC algorithms. This is the first such automated MCMC-generating proce¬ 
dure of its kind, so far as we are aware. Using a suite of example models, we have observed that our 
automated procedure generates improvements in efficiency (relative to static MCMC algorithms) 
ranging between one and three orders of magnitude. In each case, the automated procedure pro¬ 
duced an MCMC algorithm at least as efficient as any model-specific MCMC algorithm making use 
of prior knowledge or expert opinion. In all examples, our iterative procedure terminated within 
four iterations, although it is plausible that for more complex models it would proceed longer. 

Our study has been confined to a single dimension of a much broader problem. We have strictly 
considered combinations of scalar and blocked adaptive Metropolis-Hastings sampling, with a small 
number of exceptions only for the purpose of comparison {e.g., the use of conjugate sampling). No 
less, we have restricted ourselves to non-overlapping sampling: each model parameter may only be 
sampled by a single MCMC sampler function. We may instead view the domain of our problem (au¬ 
tomated determination of an efficient MCMC algorithm) as a broader space of MCMC algorithms. 
This space may permit a wide range of sampling algorithms not considered herein: auxiliary vari¬ 


able algorithms such as slice sampling (Neal, 2003), or derivative-based sampling algorithms such 
as Hamiltonian Monte Carlo (Duane et ah, [1987 ), among many possibilities. The resulting com¬ 
binatorial explosion in the space of MCMC algorithms makes any process of trial-and-error, or an 
attempt at comprehensive exploration, futile. It is for this reason we seek to develop an automated 
procedure for determining an efficient MCMC algorithm, which may not be globally, maximally 
efficient, but provides non-trivial improvements in efficiency, nonetheless. 

It should be noted that aspects of the problem addressed herein superficially resemble, but 






are fundamentally different in nature from hierarchical clustering, or sparse covariance estimation. 
Granted, our automated procedure firstly utilizes an empirical covariance matrix generated from 
MCMC sampling chains. However, whereas sparse covariance estimation seeks to estimate the non¬ 


zero elements of the underlying covariance structure (Cai and Liu, 2011), our procedure concerns the 
non-trivial (correlated) elements, with little concern for the smaller entries. Our blocking algorithm 
also makes use of the complete linkage clustering algorithm, for determining groupings of correlated 
model parameters. Clustering algorithms have been applied to a wide variety of problems (Xu and 


Wunsch, 2005), but not to parameters of hierarchical models specifically with the aim of accounting 
for trade-offs between MCMC algorithmic efficiency and computational requirements, to produce 
a computationally efficient MCMC algorithm. This is a fundamentally different goal than merely 
producing groupings of “similar” parameters (given some measure of similarity), as is generally the 
goal in most clustering applications. Thereby, the existing literature on these subjects is related, 
but not intimately applicable to our problem at hand. A deeper consideration of these topics may 
be worthwhile, but we consider it beyond the scope of this paper. 

Reasonably straightforward improvements could be made to our automated blocking procedure, 
which is presented as a sensible first approach that addresses the factors affecting MCMC algorithm 
efficiency. By design, our procedure natively accounts for differences in system platform or architec¬ 
ture that may affect the relative efficiencies of MCMC algorithms. We can envision a wide variety 
of possible extensions to our algorithm, ranging from only re-blocking the slowest mixing parameter 
on each iteration, to permitting cuts at different heights on distinct branches of the hierarchical 
clustering tree. Our procedure is intended as a proof-of-concept for the automated generation of 
efficient MCMC algorithms, and to serve as a starting point for subsequent research. 
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